STAT 331: Final Project

Author

Ethan Handelman, Maxwell Dubow, Aaron Eliscu

Setup

Code
library(tidyverse)
library(knitr)
library(kableExtra)
library(ggplot2)
library(plotly)
library(gganimate)
library(gifski)
library(png)
library(countrycode)
library(broom)

life_expectancy <- read_csv("Data/lex.csv")
coal_per_cap <- read_csv("Data/coal_consumption_per_cap.csv",
           na = c("NA", ""))

Pivoting/Joining the Data

Code
life_expectancy <- life_expectancy |>
  pivot_longer(cols = 2:last_col(),
    names_to = "year",
    values_to = "infant_lex") |>
  drop_na(infant_lex)

coal_per_cap <- coal_per_cap |>
  mutate(across(.cols = 2:last_col(), as.double)) |>
  pivot_longer(cols = 2:last_col(),
               names_to = "year",
               values_to = "coal_cons_per_cap") |>
  drop_na(coal_cons_per_cap)

lex_coal <- coal_per_cap |>
  left_join(life_expectancy, by = c("country", "year")) |>
  filter(!is.na(coal_cons_per_cap), !is.na(infant_lex)) |>
  mutate(year = as.integer(year),
         continent = countrycode(sourcevar = country,
                                 origin    = "country.name",
                                 destination = "continent")) 

Data Description

In this investigative report we will be exploring the relationship between two quantitative variables across countries/years:

The data on coal consumption per capita comes from BP Statistical Review’s report on global energy markets over time. The data we are interested in is the amount of coal consumption per country per year. The data set, which we downloaded from Gapminder contains countries as observations and coal consumption in a specific year as variables. We transformed the data so that coal consumption was the only variable and each observation was a country and year.

The data on infant life expectancy comes from Gapminder’s own data collection from various different sources. It contains information on the life expectancy at birth in each country and year. The data set, which we downloaded from Gapminder contains countries as observations and infant life expectancy in a specific year as variables. We transformed the data so that life expectancy was the only variable and each observation was a country and year.

Finally, we combined these two data sets so that for each observation (country and year), we had coal consumption and infant life expectancy as our two variables. There were a few missing values in the data, which we ignored in our analysis.

Hypothesis

We hypothesize that infant life expectancy will have a negative association with coal consumption after adjusting for year. This is because burning coal releases many by products into the atmosphere that form toxic chemicals, and “continuous inhalation of these hazardous substances triggers many diseases such respiratory and cardiovascular disease, systemic inflammation, and neurodegeneration” (Gasparotto and Da Boit Martinello 2021). However, with the elimination of a lot of the infectious disease deaths that affected young children and the better understanding of treading cardiovascular conditions and cancer, life expectancy has drastically increased each year (Crimmins 2015). Therefore, we will have to adjust for this effect in our analysis to see if there is a negative association between coal consumption and infant life expectancy like we expect.

Data Visualization

Code
lex_coal_cont <- lex_coal |>
  filter(!is.na(coal_cons_per_cap), !is.na(infant_lex)) |>
  
  group_by(continent, country) |>
  summarize(mean_coal = mean(coal_cons_per_cap),
            mean_infant = mean(infant_lex),
            .groups = "drop") |>
  filter(mean_coal != 0 & mean_infant != 0) |>
  mutate(log_mean_coal = log(mean_coal),
         log_mean_infant = log(mean_infant))

p <- lex_coal_cont |>
  ggplot(mapping = aes(x = log_mean_coal, 
                       y = log_mean_infant, 
                       label = country,
                       color = continent,
                       text = paste0("<b>Country:</b> ", 
                                     country, 
                                     "<br>"))) +
  geom_point() +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = "black",
    linetype = "dashed",
    inherit.aes = FALSE,
    mapping = aes(x = log_mean_coal, y = log_mean_infant)
  ) +
  labs(x = "Log Mean Coal Consumption (tonnes oil equivalent)",
       y = "Log Infant Life Expectancy (years)",
       title = "Infant Life Expectancy vs Coal Consumption by Country") + theme(axis.title = element_text(size = 10),
                                                                                plot.title = element_text(hjust = 0.5))

ggplotly(p, tooltip = "text") |>
  layout(
    margin = list(t = 120),
    title = list(
      text = "Infant Life Expectancy vs Coal Consumption by Country",
      pad = list(t = 40),    
      x = 0.5,               
      xanchor = "center"     
    )
  )

The above graph shows the relationship between mean coal consumption and mean infant life expectancy. We can see that a 2-fold increase in the mean coal consumption per capita is associated with an increase in the predicted mean infant life expectancy by a factor of 1.012. Additionally, we can see that on average, the mean infant life expectancy in Oceania, Europe and the Americas is higher than the mean infant life expectancy in Asia and Africa. Finally, we can see that South Africa appears to be an outlier, with one of the highest observed mean coal consumption per capita but the lowest mean infant life expectancy.

Code
p_anim <- lex_coal |>
  filter(coal_cons_per_cap != 0 & infant_lex != 0) |>
  mutate(char_nudge = nchar(country) * -0.07) |>
  ggplot(aes(x = log(coal_cons_per_cap), 
             y = log(infant_lex),
             color = continent,
             group = country)) +
  geom_point(alpha = 0.7, 
             size = 2) +
  labs(
    title = "Infant Life Expectancy vs. Coal Consumption",
    subtitle = "Year: {frame_time}",
    x = "Log Coal Consumption per Capita (tonnes oil equivalent)",
    y = "Log Infant Life Expectancy (years)"
  ) +
  geom_text(aes(x = log(coal_cons_per_cap) + char_nudge,
                label = country),
            family = "mono",
            check_overlap = TRUE) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal(base_size = 14) +
  theme(
    plot.subtitle = element_text(size = 16, face = "bold"),
    legend.position = "right"
  ) +
  transition_time(year) +
  ease_aes("linear")

anim <- animate(
  p_anim,
  nframes = length(unique(lex_coal$year)) * 5,
  fps = 10,
  width = 800,
  height = 600,
  renderer = gifski_renderer(loop = TRUE)
)

anim

anim_save("coal_infant_by_continent2.gif", animation = anim)

Animated Relationship

The above graph shows that over time, both the infant life expectancy and coal consumpiton per capita has increased in all countries. We can see a big increase in coal consumption per capita across many Asian countries.

Linear Regression Models

We used a linear regression model to predict the log mean infant life expectancy (years) form the log mean coal consumption per capita (tonnes oil equivalent).

Code
model <- lm(log_mean_infant ~ log_mean_coal, data = lex_coal_cont)
coefs <- broom::tidy(model)

knitr::kable(
  coefs,
  digits  = 3,
  caption = "Regression Coefficients\n(log Mean Infant Life Expectancy ∼ log Mean Coal Consumption)",
  col.names = c("Term", "Estimate", "Standard Error", "t value", "p value")
) |>
  kableExtra::kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Regression Coefficients (log Mean Infant Life Expectancy ∼ log Mean Coal Consumption)
Term Estimate Standard Error t value p value
(Intercept) 4.304 0.010 417.595 0
log_mean_coal 0.017 0.004 4.329 0

Variance Summary

Code
var_response <- var(lex_coal_cont$log_mean_infant)
var_fitted <- var(fitted(model))
var_residuals <- var(residuals(model))
r_squared <- var_fitted / var_response
summary_table <- tibble(
  Metric = c("Variance of Response (A)", 
             "Variance of Fitted Values (B)", 
             "Variance of Residuals", 
             "Model R² (B/A)"),
  Value = c(var_response, var_fitted, var_residuals, r_squared)
)

summary_table |>
  kable(format = "html", digits = 4, caption = "Regression Model Variance Summary") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
Regression Model Variance Summary
Metric Value
Variance of Response (A) 0.0053
Variance of Fitted Values (B) 0.0011
Variance of Residuals 0.0042
Model R² (B/A) 0.2042

Analysis

In our model, our R-squared value is 0.2042. As a result, approximately 20.4% of the variance in log mean infant mortality rate is explained by the variance in log mean coal consumption per capita. Since our p value is below 0.05, we are able to say that the relationship between infant mortality rate and coal exposure is statistically significant. In conclusion, our model explains about one fifth of the variability in infant mortality.

Improved Model

To investigate the effect of the progression of medical technology and its effect on year, we add the variable year to our model.

Code
model <- lm(log(infant_lex) ~ log(coal_cons_per_cap) + year, 
            data = lex_coal |> filter(coal_cons_per_cap != 0))
coefs <- broom::tidy(model)

knitr::kable(
  coefs,
  digits  = 3,
  caption = "Regression Coefficients\n(log Mean Infant Life Expectancy ∼ log Mean Coal Consumption)",
  col.names = c("Term", "Estimate", "Standard Error", "t value", "p value")
) |>
  kableExtra::kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Regression Coefficients (log Mean Infant Life Expectancy ∼ log Mean Coal Consumption)
Term Estimate Standard Error t value p value
(Intercept) -2.117 0.163 -12.981 0
log(coal_cons_per_cap) 0.015 0.001 25.362 0
year 0.003 0.000 39.400 0

Variance Summary

Code
var_response <- var(lex_coal_cont$log_mean_infant)
var_fitted <- var(fitted(model))
var_residuals <- var(residuals(model))
r_squared <- var_fitted / var_response
summary_table <- tibble(
  Metric = c("Variance of Response (A)", 
             "Variance of Fitted Values (B)", 
             "Variance of Residuals", 
             "Model R² (B/A)"),
  Value = c(var_response, var_fitted, var_residuals, r_squared)
)

summary_table |>
  kable(format = "html", digits = 4, caption = "Regression Model Variance Summary") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
Regression Model Variance Summary
Metric Value
Variance of Response (A) 0.0053
Variance of Fitted Values (B) 0.0038
Variance of Residuals 0.0058
Model R² (B/A) 0.7185

Analysis

In our new model, our R-squared value is 0.7185. As a result, approximately 71.85% of the variance in log infant mortality rate is explained by the variance in log coal consumption per capita and year. Since our p value is below 0.05, we are able to say that the relationship between infant mortality rate and coal exposure is statistically significant. In conclusion, our model explains about one fifth of the variability in infant mortality.

While the estimated slope coefficient of coal log consumption per capita is smaller, it is still positive, indicating that an increase in coal consumption leads to an increase in predicted infant life expectancy. This goes against what we hypothesized, though we suspect that the reason for the discrepancy is confounding variables like GDP. More analysis is needed to understand why coal consumption per capita is associated with an increase in infant life expectancy.

Cross Validation

Performing the Validation

Code
set.seed(2025)

# determine number of folds
n <- nrow(lex_coal_cont)
k <- floor(n / 10)

# assign rows to folds randomly
fold_id <- sample(rep(1:k, length.out = n))

#function for fitting
compute_fold_r2 <- function(fold, data, fold_id) {
  train_df <- data[fold_id != fold, ]
  test_df  <- data[fold_id == fold, ]
  mod      <- lm(log_mean_infant ~ log_mean_coal, data = train_df)
  preds    <- predict(mod, newdata = test_df)
  var(preds) / var(test_df$log_mean_infant)
}

# perform the cross validation
cv_results <- tibble(fold = 1:k) |>
  mutate(
    r2 = map_dbl(fold, compute_fold_r2,
                 data = lex_coal_cont,
                 fold_id = fold_id)
  )

cv_results |>
  kable(format = "html", digits = 4, caption = "K Fold R-Squared Values") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
K Fold R-Squared Values
fold r2
1 0.3615
2 0.2025
3 0.2123
4 0.0373
5 0.1006
6 0.3745
7 0.3721

Visualization

Code
mean_r2 <- mean(cv_results$r2)

ggplot(cv_results, aes(x = r2)) +
  geom_histogram(binwidth = 0.02, color = "black", fill = "lightblue") +
  geom_vline(xintercept = mean_r2, linetype = "dashed", size = 1) +
  labs(
    title = "Distribution of Cross-Validated R^2 Values",
    subtitle = paste0("k = ", k, " folds; mean R² = ", round(mean_r2, 3)),
    x = expression(R^2),
    y = "Count"
  ) +
  theme_minimal(base_size = 14)

References

Crimmins, Eileen M. 2015. “Lifespan and Healthspan: Past, Present, and Promise.” The Gerontologist 55 (6): 901–11. https://doi.org/10.1093/geront/gnv130.
Gasparotto, Juciano, and Kátia Da Boit Martinello. 2021. “Coal as an Energy Source and Its Impacts on Human Health.” Energy Geoscience 2 (2): 113–20. https://doi.org/10.1016/j.engeos.2020.07.003.